TALP: Xgram-based spoken language translation system

نویسندگان

  • Adrià de Gispert
  • José B. Mariño
چکیده

This paper introduces TALP, a speech-to-speech statistical machine translation system developed at the TALP Research Center (Barcelona, Spain). TALP generates translations by searching for the best scoring path through a Finite-State Transducers (FSTs), which models an Xgram of the bilingual language defined by tuples. A detailed description of the system and the core processes to train it from a parallel corpus are presented. Results on the Chinese-English supplied task of the Int. Workshop on Spoken Language Translation (IWSLT’04) Evaluation Campaign are shown and discussed. 1. Overview of the system TALP (Traducció Automàtica del Llenguatge Parlat) is a speech-to-speech statistical machine translation system developed at the TALP Research Center (Barcelona, Spain) during the last years. It implements an integrated architecture by joining speech recognition and translation in one single step. Mathematically, the system produces a translation by maximizing the joint probability between source and target languages, which is equivalent to a language model of an special language with bilingual units (called tuples). TALP implements this tuple language model by means of a Finite-State Transducer (FST) considering an Xgram memory, that is, a variablelength N-gram model which adapts its length to evidence in the data. Xgrams have proved good results in speech recognition tasks in the past [1]. Given such a bilingual FST, the search for a translation becomes the search for the best-scoring path among the transducer’s edges. This search can be performed by dynamic programming, using well-known decoding techniques from the speech recognition domain. This way, the Viterbi algorithm and a beam search can be used forwards taking only source-language words into account (first part of each tuple), reading words in the target language during trace-back to produce the translation. Using This work has been partially supported by the Spanish Government under grant TIC2002-04447-C02 (ALIADO project), the European Union under grant FP6-506738 (TC-STAR project) and the Dep. of Universities, Research and Information Society (Generalitat de Catalunya). Figure 1: A translation FST from Spanish to English the same structure and search method, acoustic models can be omitted to perform text translation tasks only. This translation FST is learned automatically from a parallel corpus in three main steps (and an optional preprocessing). First, an automatic word alignment is produced. Currently this is done by the freely-available GIZA++ software [2], implementing well-known IBM and HMM translation models [3, 4]. From this alignment, a tuple extraction algorithm generates the set tuples that induces a sequential segmentation of both source and target sentences. These tuples must respect word order in both languages, as this is necessary for the transducer to produce a correct-order translated output. Finally, Xgrams are learned using standard language modeling techniques. Previous publications on this system include [5] and [6]. The organization of the paper is as follows. Section 2 offers an overview of the system architecture, whereas sections 2 and 3 deepen into details on translation generation and training issues. Section 4 presents the experimental framework used to evaluate the system, whose results are discussed in section 5. Finally, section 6 concludes and outlines future research lines. 2. Translation generation Statistical machine translation is based on the assumption that every sentence e in the target language is a possible translation of a given sentence f in the source language. The main difference between two possible translations of a given sentence is a probability assigned to each, which is to be learned from a bilingual text corpus. This probability can be modeled by a joint probability model of source and target languages. In this case, solving the translation problem is finding the sentence in the target language that maximises equation 1. This probability can be approximated by an Xgram of a joint or bilingual language model, learned from a set of tuples, as expressed in equation 2. ê = arg max e {p(e, f)} = · · · = (1)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The TALP ngram-based SMT system for IWSLT'05

This paper provides a description of TALP-Ngram, the tuple-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politècnica de Catalunya). Briefly, the system performs a log-linear combination of a translation model and additional feature functions. The translation model is estimated as an N-gram of bilingual units called tuples, and the fea...

متن کامل

TALP phrase-based system and TALP system combination for IWSLT 2006

This paper describes the TALP phrase-based statistical machine translation system, enriched with the statistical machine reordering technique. We also report the combination of this system and the TALP-tuple, the n-gram-based statistical machine translation system. We report the results for all the tasks (Chinese, Arabic, Italian and Japanese to English) in the framework of the third evaluation...

متن کامل

The TALP n-gram-based SMT system for IWSLT 2006

This paper describes TALPtuples, the 2006 Ngrambased statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years, being highlighted and empirically compared. Mainly, these include a novel and much more efficient word ordering strategy ba...

متن کامل

The TALP n-gram-based SMT system for IWSLT 2007

This paper describes TALPtuples, the 2007 N -gram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing alignment parameters in function of translation metric scores and rescoring with a neur...

متن کامل

The TALP-UPC Spanish-English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System

This paper describes the TALP–UPC system in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a characterbased neural language model with rescoring. The former focuses on resolving outof-vocabulary words, while the latter enhances the fluency of the system. The two modules prog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004